Assessing Avocado Pricing Dynamics Utilizing Climate, Transportation Cost, and Macroeconomic Metrics in California

Problem Statement:

In recent years, avocados have seen a surge in popularity across social media platforms, leading to an exponential increase in demand and establishing them as one of the most popular fruits globally. Research by Mordor Intelligence (Avocado Market Insights, n.d.) forecasts substantial growth in the avocado market, with the projected market size expected to rise from USD 22.69 billion in 2024 to USD 35.55 billion by 2029. Additionally, the USDA reports annual per capita consumption of avocados exceeding 8 pounds in the North American market, with local production and imports serving as the primary sources of supply (Avocado Market Insights, n.d.).

Given the perishable nature of avocados, maintaining fruit quality heavily depends on a robust supply chain that efficiently manages harvesting, transportation, and distribution processes. As avocado demand continues to escalate, understanding the driving factors behind market expansion becomes imperative.

This project focuses on California, a leading region in both avocado cultivation and consumption, with the specific aim of identifying primary drivers influencing avocado price fluctuations within the state. By analyzing critical components of the avocado supply chain—production, transportation, and consumer behavior—the objective is to develop a robust regression model that not only forecasts future avocado prices but also highlights the importance of each factor influencing these prices. The goal is to achieve a comprehensive understanding of avocado pricing dynamics by incorporating features representing various stages of the supply chain.

Steps:

  1. Data Collection:

    • Production Data: Collect climatic data from key production regions (Uruapan, Michoacan, Mexico, and Fallbrook, California) from Weather of the World, covering temperature, precipitation, humidity, wind speed, atmospheric pressure, and other relevant factors.
    • Transportation Cost Data: Gather energy consumption expenses, including unit prices of electricity, petroleum, and natural gas in California from the U.S. Energy Information Administration (EIA).
    • Consumer and Economic Data: Obtain economic variables such as the Consumer Price Index, Personal Consumption Expenditures, Producer Price Index, Inflation Rate, and Unemployment Rate from the Federal Reserve Economic Data (FRED).
    • Target Variable: Collect weekly avocado price data for California from the Kaggle dataset "Avocado Prices and Sales Volume 2015-2023."
  2. Data Examination:

    • Identify and handle missing values and outliers within the dataset.
    • Aggregate data to monthly data points and compute monthly averages for model development.
    • Review correlations between different attributes to understand interdependencies and refine feature selection for model development.
  3. Data Description:

    • Provide detailed descriptions of the collected data for each component: Avocado Price Data, Plantation Data (Uruapan and Fallbrook), California Energy Data, and Economic Data.
  4. Exploratory Data Analysis (EDA):

    • KDE Plots: Create KDE plots for each attribute to visualize their distributions.
    • Correlation Analysis: Generate correlation heatmaps to identify relationships between attributes.
    • Pairwise Distribution: Plot pairwise distributions of attributes to examine their interactions.
  5. Model Development:

    • Feature Engineering: Select and engineer relevant features based on EDA findings.
    • Regression Model: Develop a regression model to forecast avocado prices, incorporating features representing various stages of the supply chain.
  6. Model Evaluation:

    • Evaluate the model's performance using appropriate metrics such as RMSE, MAE, and R-squared.
    • Interpret the model to understand the importance of each factor influencing avocado prices.
  7. Conclusion:

    • Summarize findings and insights derived from the model.
    • Discuss the implications of these findings for avocado pricing dynamics and potential applications in supply chain management.
  8. Report Writing:

    • Compile the findings, methodologies, and conclusions into a comprehensive project report.
    • Include visualizations, data descriptions, and model interpretations to support the analysis.

By following these steps, the project aims to provide a detailed understanding of the factors driving avocado price fluctuations in California, ultimately aiding in better forecasting and supply chain management strategies.

Part I - EDA

Avocado Price Data

Economic Data

Fallbrook weather data

Energy Data in California

Summary of Data Review

The analysis of 216 monthly avocado price points spanning from January 2015 to December 2023, encompassing both organic and conventional avocados, yielded the following insights:

  1. Data Completeness: No null values were found across all fields within the dataset.

  2. Price Stationarity:

    • Visual inspection of price point distribution plots suggests stationarity.
    • Augmented Dickey-Fuller (ADF) Test results with a p-value of approximately 0.028, below the chosen significance level of 0.05, reject the null hypothesis, indicating that the price data is indeed stationary.
  3. Plantation Data Evaluation:

    • Evaluation plots of plantation data from Fallbrook and Uruapan, assessed using Kernel Density Estimation (KDE) plots with automatic bandwidth selection, indicate normally distributed attributes without any outliers.
  4. Energy Data Evaluation:

    • Similar KDE evaluation of energy data in California reveals normally distributed attributes with no identified outliers.
  5. Economic Data Correlation:

    • Analysis of economic data using an attribute correlation heatmap reveals significant correlations between certain variables.
    • Notably, Unemployment_Level correlates highly with Unemployment_Rate; Average_Hourly_Earnings_of_All_Employees correlates with Personal_Consumption_Expenditures; and Employed_Persons_in_California correlates with Labor_Force_Participation_Rate_for_California.
    • Consequently, Unemployment_Level, Average_Hourly_Earnings_of_All_Employees, and Employed_Persons_in_California are excluded from the model development process.
  6. Economic Factors Manipulation:

    • Economic factors are adjusted into leading indicators, implying that changes in economic factors may lag behind changes in avocado prices.
    • For example, a rise in unemployment may not immediately impact avocado prices within the same month but could affect the price changes two months later.
    • Considering this lag effect, each avocado price point incorporates economic factors not only from the month of avocado pricing but also from two months prior.

Model Development

Methodology and Model Development

To estimate avocado prices based on various input features, the linear regression model is commonly utilized. This model represents a linear combination of features and their respective coefficients. Advanced techniques in linear regression may involve mapping input attributes to higher-dimensional feature spaces and incorporating interaction terms to capture synergy effects. Additionally, local data fitting methods, such as locally weighted regression, can enhance model accuracy.

In the development of a linear regression model, selecting the most important features is crucial. Three commonly used feature selection techniques in linear regression are Forward Selection, Backward Elimination, and Recursive Feature Elimination. Below are descriptions of each technique:

Forward Selection:

  1. Begins with an empty set of selected features.
  2. Iteratively selects the best feature to add to the selected set based on improvements in model performance.
  3. Continues this process until the desired number of features is selected.

Backward Elimination:

  1. Starts with all features included in the selected features set.
  2. Iteratively removes the worst-performing feature based on degradation in model performance.
  3. Continues until the number of selected features reaches the minimum desired number.

Recursive Feature Elimination:

  1. Iterates through each remaining feature.
  2. Adds the feature to the selected features.
  3. Trains the model using the selected features.
  4. Evaluates the model performance using a chosen metric.
  5. If the model performance improves, updates the best score and best feature.
  6. Backtracks by removing the feature to explore other features.

For this project, linear regression with feature elimination is not considered due to the simplicity of the linear regression model. While linear regression offers good interpretability compared to other regression models, its simplicity may not effectively model the complex nonlinear relationships inherent in avocado prices. Instead, regression trees and random forest models will be developed using 80% of the input data for training. The remaining 20% of the data will be used to evaluate the performance of the regression tree and random forest models.

Part II: Model Development

Classification And Regression Tree

Random Forest

Conclusion

This project aimed to compare and evaluate the performance of two regression models—Regression Tree and Random Forest—for predicting avocado prices. Each model was rigorously assessed using a comprehensive dataset, with a focus on interpretability, generalizability, and feature importance.

Regression Tree Analysis

The Regression Tree model, while intuitive and capable of capturing complex relationships, demonstrated limitations in generalizing to unseen data. The model achieved a training $R^2$ score of 0.827, indicating a good fit to the training dataset. However, its testing $R^2$ score of 0.589 suggested a considerable drop in performance on new data, highlighting its susceptibility to overfitting.

Random Forest Analysis

In contrast, the Random Forest model addressed the overfitting issue by aggregating predictions from multiple decision trees. Tuned through grid search cross-validation, the optimal configuration of $\text{min_samples_leaf} = 3$ and $\text{n_estimators} = 60$ yielded robust results. The Random Forest achieved a high training $R^2$ score of 0.947 and a testing $R^2$ score of 0.798, showcasing superior generalizability and less sensitivity to overfitting compared to the Regression Tree.

Feature Importance Insights

Feature importance analysis revealed key drivers influencing avocado prices. Organic designation significantly impacted prices, aligning with consumer preferences for higher-priced organic produce. Changes in personal consumption expenditures influenced prices with a two-month lag, highlighting economic factors' delayed impact on avocado markets. Additionally, local weather conditions and electricity prices in California played minor yet discernible roles in price fluctuations.

Conclusion and Implications

In conclusion, while the Regression Tree provides transparency in decision-making, its limited generalizability compromises its utility for robust predictions on unseen data. The Random Forest model, leveraging ensemble learning, offers superior predictive performance and feature interpretability, making it suitable for practical applications where accuracy and resilience to overfitting are paramount.

This study's findings suggest that leveraging Random Forest models can enhance the accuracy of price predictions in agricultural markets, with implications for broader analyses of food price dynamics. Future research could expand this methodology to explore additional economic and environmental variables impacting food prices across regions.